Understanding the Logical and Semantic Structure of Large Documents
نویسندگان
چکیده
Current language understanding approaches focus on small documents, such as newswire articles, blog posts, product reviews and discussion forum entries. Understanding and extracting information from large documents like legal briefs, proposals, technical manuals and research articles is still a challenging task. We describe a framework that can analyze a large document and help people to know where a particular information is in that document. We aim to automatically identify and classify semantic sections of documents and assign consistent and human-understandable labels to similar sections across documents. A key contribution of our research is modeling the logical and semantic structure of an electronic document. We apply machine learning techniques, including deep learning, in our prototype system. We also make available a dataset of information about a collection of scholarly articles from the arXiv eprints collection that includes a wide range of metadata for each article, including a table of contents, section labels, section summarizations and more. We hope that this dataset will be a useful resource for the machine learning and NLP communities in information retrieval, content-based question answering and language modeling.
منابع مشابه
روش جدید متنکاوی برای استخراج اطلاعات زمینه کاربر بهمنظور بهبود رتبهبندی نتایج موتور جستجو
Today, the importance of text processing and its usages is well known among researchers and students. The amount of textual, documental materials increase day by day. So we need useful ways to save them and retrieve information from these materials. For example, search engines such as Google, Yahoo, Bing and etc. need to read so many web documents and retrieve the most similar ones to the user ...
متن کاملModern literary interpretation in understanding the meaning of the verse ‘There is nothing like Him’
Numerous views have been expressed by commentators and writers about the literary aspect and the meaning of the Qurchr('39')anic phrase "There is nothing like Him". The sequence of the words "ka" and "like" in the holy verse, has led to two literary and semantic illusions. The literary illusion is that "ka" seems to be redundant and the semantic illusion is the word ‘like’ indirectly proves the...
متن کاملSemantic Indexing of Technical Documentation
This research takes place in an industrial context: the CONTINEW Company. This company ensures the storage and security of critical data and technical documentation. Consequently, it is necessary to organize these documents in order to retrieve quickly critical information. The management of this increasing volume of documents requires document classification which is based on indexing techniqu...
متن کاملA Document Reuse Tool for Communities of Practice
With the rise of the Internet, virtual communities of practice are gaining importance as a mean of sharing and exchanging information. In such environments, information reuse is of major concern. In this paper, we outline the importance of enriching documents with structural and semantic information in order to facilitate their reuse. We propose a framework for document reuse based on an explic...
متن کاملA Model for Conformance Analysis of Software Documents
During the evolution of a large-scale software project, developers produce a large variety of software artifacts such as requirement specifications, design documents, source code, documentation, bug reports, etc. These software documents are not isolated items — they are semantically related to each other. They evolve over time and the set of active semantic relationships among them is also dyn...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1709.00770 شماره
صفحات -
تاریخ انتشار 2017